-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Fix longstanding regex interpreter bug around lazy loops with empty matches #120872
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This has been an issue in the interpreter forever, as far as I can tell. We've had multiple issues over the years all flagging problems with different symptoms that stem from the same core problem. The problem came down to how the regex interpreter handled lazy quantifiers over expressions that can match the empty string. When the interpreter reaches one of these lazy loops, it uses an internal instruction called `Lazybranchmark` to manage entering and potentially looping the subexpression. To keep track of loop state and capture boundaries, the interpreter uses two internal stacks, a grouping stack, that tracks positions relevant to capturing groups (e.g. where a group started), and a backtracking stack, that tracks states that are needed if the engine has to go back and try a different match. The bug occurred in the case when the subpattern inside the lazy loop matches nothing. In this case, the interpreter unconditionally pushed a placeholder onto the grouping stack. If the rest of the pattern then succeeded without backtracking through this loop, that extra placeholder remained on the grouping stack. This polluted the capture bookkeeping: later parts of the pattern popped that placeholder, treating it as a real start position, and shifted captures to the wrong place. The fix is to stop pushing onto the grouping stack when the loop matches empty. Instead, the interpreter records two things on the backtracking stack: the old group boundary and a flag indicating whether the grouping stack needs to be popped later. If the interpreter ends up backtracking through this lazy loop, it checks the flag: if a grouping stack entry was added earlier, it pops it; if not, it leaves the grouping stack untouched. This keeps the grouping stack and backtracking stack in sync in both forward and backtracking paths. As a result, empty lazy loops no longer leave stray entries on the grouping stack. This also prevents the unbounded stack growth that previously caused overflows or hangs on some patterns involving nested lazy quantifiers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a longstanding bug in the regex interpreter related to handling lazy quantifiers over expressions that can match empty strings. The bug caused incorrect capture group positioning and potential stack overflow issues when lazy loops matched empty strings.
- Fixed stack management in the regex interpreter to properly handle empty matches in lazy quantifiers
- Updated backtracking logic to track whether grouping stack entries need to be popped
- Added comprehensive test cases covering the various scenarios that previously failed
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| RegexInterpreter.cs | Fixed core bug by modifying stack management logic for lazy quantifiers with empty matches |
| Regex.Match.Tests.cs | Added test cases for lazy loops with empty matches that previously failed |
| Regex.MultipleMatches.Tests.cs | Added test cases for multiple match scenarios with lazy quantifiers |
...raries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexInterpreter.cs
Show resolved
Hide resolved
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed description, without it would be hard to understand the fix!
|
/ba-g deadletter |
This has been an issue in the interpreter forever, as far as I can tell. We've had multiple issues over the years all flagging problems with different symptoms that stem from the same core problem.
Fixes #43314
Fixes #58786
Fixes #63385
Fixes #111051
Fixes #114626
The problem came down to how the regex interpreter handled lazy quantifiers over expressions that can match the empty string. When the interpreter reaches one of these lazy loops, it uses an internal instruction called
Lazybranchmarkto manage entering and potentially looping the subexpression. To keep track of loop state and capture boundaries, the interpreter uses two internal stacks, a grouping stack, that tracks positions relevant to capturing groups (e.g. where a group started), and a backtracking stack, that tracks states that are needed if the engine has to go back and try a different match. The bug occurred in the case when the subpattern inside the lazy loop matches nothing. In this case, the interpreter unconditionally pushed a placeholder onto the grouping stack. If the rest of the pattern then succeeded without backtracking through this loop, that extra placeholder remained on the grouping stack. This polluted the capture bookkeeping: later parts of the pattern popped that placeholder, treating it as a real start position, and shifted captures to the wrong place.The fix is to stop pushing onto the grouping stack when the loop matches empty. Instead, the interpreter records two things on the backtracking stack: the old group boundary and a flag indicating whether the grouping stack needs to be popped later. If the interpreter ends up backtracking through this lazy loop, it checks the flag: if a grouping stack entry was added earlier, it pops it; if not, it leaves the grouping stack untouched. This keeps the grouping stack and backtracking stack in sync in both forward and backtracking paths. As a result, empty lazy loops no longer leave stray entries on the grouping stack. This also prevents the unbounded stack growth that previously caused overflows or hangs on some patterns involving nested lazy quantifiers.